Skip to content

parquet: Add tests for disabling statistics on multiple columns#16639

Open
algojogacor wants to merge 1 commit into
apache:mainfrom
algojogacor:fix/multi-column-stats-disabled
Open

parquet: Add tests for disabling statistics on multiple columns#16639
algojogacor wants to merge 1 commit into
apache:mainfrom
algojogacor:fix/multi-column-stats-disabled

Conversation

@algojogacor
Copy link
Copy Markdown

This is an independent contribution and is not associated with any hackathon or competition.

Summary

Adds two new test methods to TestParquet.java covering the edge case where per-column statistics are disabled on more than one column simultaneously.

Root Cause

The existing test testColumnStatisticsEnabled only validates the single-column-disabled scenario. Issue #15347 reported that when statistics are disabled on multiple columns (e.g., write.parquet.stats-enabled.column.foo=false and write.parquet.stats-enabled.column.bar=false), the second column may still write statistics. These tests provide coverage for that scenario.

Changes

  • testColumnStatisticsDisabledMultipleColumns: Schema with 3 columns (int, string, double); stats enabled on col 1, disabled on cols 2 and 3. Verifies both disabled columns correctly omit statistics.
  • testColumnStatisticsDisabledAllColumns: Schema with 2 columns; stats disabled on both. Verifies every column omits statistics.

Testing

Both new tests follow the same pattern as the existing testColumnStatisticsEnabled: write Parquet data with the given properties, read back the footer, and assert that each ColumnChunkMetaData has the expected statistics state (empty or non-empty).

Add two new test cases to verify per-column statistics configuration
works correctly when stats are disabled on more than one column:

- testColumnStatisticsDisabledMultipleColumns: verifies that when stats
  are enabled on one column and disabled on two others, both disabled
  columns correctly omit statistics in the Parquet file.

- testColumnStatisticsDisabledAllColumns: verifies that when stats are
  disabled on every column, all columns correctly omit statistics.

These tests address the edge case reported in issue apache#15347, where
disabling statistics across multiple columns could result in some
columns still writing statistics.
Comment on lines +309 to +322
try (ParquetFileReader reader = ParquetFileReader.open(ParquetIO.file(inputFile))) {
for (BlockMetaData block : reader.getFooter().getBlocks()) {
for (ColumnChunkMetaData column : block.getColumns()) {
boolean emptyStats = column.getStatistics().isEmpty();
if (column.getPath().toDotString().equals("int_field")) {
assertThat(emptyStats).as("int_field has statistics").isEqualTo(false);
} else if (column.getPath().toDotString().equals("string_field")) {
assertThat(emptyStats).as("string_field has statistics disabled").isEqualTo(true);
} else if (column.getPath().toDotString().equals("double_field")) {
assertThat(emptyStats).as("double_field has statistics disabled").isEqualTo(true);
}
}
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The readability of this logic seems a bit low. Could you consider introducing a helper method?

    try (ParquetFileReader reader = ParquetFileReader.open(ParquetIO.file(inputFile))) {
      for (BlockMetaData block : reader.getFooter().getBlocks()) {
        assertStats(block, "int_field", false);
        assertStats(block, "string_field", true);
        assertStats(block, "double_field", true);
      }
    }
...
  private static void assertStats(BlockMetaData rowGroup, String columnName, boolean empty) {
    ColumnPath columnPath = ColumnPath.fromDotString(columnName);
    ColumnChunkMetaData chunkMetaData =
        getOnlyElement(
            rowGroup.getColumns().stream()
                .filter(column -> columnPath.equals(column.getPath()))
                .toList());
    assertThat(chunkMetaData.getStatistics().isEmpty()).isEqualTo(empty);
  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants